Project Overview

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:

I went to the

the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone you will work on understanding and building predictive text models like those used by SwiftKey.

Goal

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set.

The motivation for this project is to:

  1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.
  2. Create a basic report of summary statistics about the data sets.
  3. Report any interesting findings that you amassed so far.
  4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Review criteria

  1. Does the link lead to an HTML page describing the exploratory analysis of the training data set?
  2. Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables?
  3. Has the data scientist made basic plots, such as histograms to illustrate features of the data?
  4. Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate?
library(stringr)
library(dplyr)
library(quanteda)
library(readtext)
library(R.utils)
library(ggplot2)

set.seed(3301)

Task 0: Understanding the problem

Tasks to accomplish

  1. Obtaining the data - Can you download the data and load/manipulate it in R?
  2. Familiarizing yourself with NLP and text mining - Learn about the basics of natural language processing and how it relates to the data science process you have learned in the Data Science Specialization.

Questions to consider

  1. What do the data look like?
  2. Where do the data come from?
  3. Can you think of any other data sources that might help you in this project?
  4. What are the common steps in natural language processing?
  5. What are some common issues in the analysis of text data?
  6. What is the relationship between NLP and the concepts you have learned in the Specialization?

Dwonload data.

source("downloadData.R")

dataPath <- file.path("..", "data")
attach(downloadData(dataPath))
c(blogs, twitter, news, badwords)
## [1] "../data/final/en_US/en_US.blogs.txt"  
## [2] "../data/final/en_US/en_US.twitter.txt"
## [3] "../data/final/en_US/en_US.news.txt"   
## [4] "../data/bad-words.txt"

First, try to processing entire files using our scratch implementation:

tweets <- 0
wordsTwitter <- 0
sentencesTwitter <- 0
con <- file(twitter, "r")
while (length(oneLine <- readLines(con, 1, warn = FALSE)) > 0) {
        # Count tweet
        tweets <- tweets + 1
        # Show first 10 tweet
        if(tweets <= 10) {
                print(oneLine)
        }
        # Tokenize by regular expression
        words <- str_split(oneLine, "\\s+")[[1]]
        # To detect symbols like a ':)', initialise variable
        symbols <- rep(FALSE, length = length(words))
        # Each token:
        for(i in 1:length(words)) {
                # Extract token that has only symbol string
                symbols[i] <- grepl("^[^a-zA-Z0-9]+$", words[i])
                # numbers, aggregate in '[numbers]'
                if(grepl("^[0-9]+$", words[i])) {
                        words[i] <- "[numbers]"
                }
        }
        # Tokens
        wordsPerLine <- length(simpleWords <- words[!symbols])
        # Count tokens ending with punctuation as the number of sentences
        for(i in 1:length(simpleWords)){
                if(grepl("[.!?]$", simpleWords[i])) {
                        sentencesTwitter <- sentencesTwitter + 1
                }
        }
        wordsTwitter <- wordsTwitter + wordsPerLine
        remove(simpleWords, words)
}
close(con)

tweets
wordsTwitter
sentencesTwitter
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [1] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [1] "they've decided its more fun if I don't."
## [1] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"
## [1] "Words from a complete stranger! Made my birthday even better :)"
## [1] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"
## [1] "i no! i get another day off from skool due to the wonderful snow (: and THIS wakes me up...damn thing"
## [1] "I'm coo... Jus at work hella tired r u ever in cali"
## [1] "The new sundrop commercial ...hehe love at first sight"
## [1] "we need to reconnect THIS WEEK"

## [1] 2360148
## [1] 29706404
## [1] 2818583

It takes a long time to calculate, so re-implement it using the package.

Loading files using the readtext package.

corpusTwitter <- readtext(twitter) %>%
        corpus() %>%
        summary()
## Corpus consisting of 1 document:
## 
##               Text  Types   Tokens Sentences
##  en_US.twitter.txt 566951 36719658   2588551
## 
## Source: /Users/warhol/Documents/!work/Data-Science-Capstone/MilestoneReport/* on x86_64 by warhol
## Created: Mon Jul 30 20:36:18 2018
## Notes: 

Task 1: Getting and cleaning the data

Tasks to accomplish

  1. Tokenization - identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it.
  2. Profanity filtering - removing profanity and other words you do not want to predict.

Tips, tricks, and hints

  1. Loading the data in. This dataset is fairly large. We emphasize that you don’t necessarily need to load the entire dataset in to build your algorithms (see point 2 below). At least initially, you might want to use a smaller subset of the data. Reading in chunks or lines using R’s readLines or scan functions can be useful. You can also loop over each line of text by embedding readLines within a for/while loop, but this may be slower than reading in large chunks at a time. Reading pieces of the file at a time will require the use of a file connection in R. For example, the following code could be used to read the first few lines of the English Twitter dataset:con <- file(“en_US.twitter.txt”, “r”) readLines(con, 1) ## Read the first line of text readLines(con, 1) ## Read the next line of text readLines(con, 5) ## Read in the next 5 lines of text close(con) ## It’s important to close the connection when you are done See the ?connections help page for more information.

  2. Sampling. To reiterate, to build models you don’t need to load in and use all of the data. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data. Remember your inference class and how a representative sample can be used to infer facts about a population. You might want to create a separate sub-sample dataset by reading in a random subset of the original data and writing it out to a separate file. That way, you can store the sample and not have to recreate it every time. You can use the rbinom function to “flip a biased coin” to determine whether you sample a line of text or not.

Sub-Sampling

tweets <- as.numeric(countLines(twitter))
twitterSubSampling <- file.path(dataPath, "sub-sample.twitter.txt")
if(!file.exists(twitterSubSampling)) {
        subSamplingRate <- .01
        flipABiasedCoin <- rbinom(tweets, size = 1, prob = subSamplingRate)
        conRead <- file(twitter, "r")
        conWrite <- file(twitterSubSampling, "w")
        len <- 0
        while (length(oneLine <- readLines(conRead, 1, warn = FALSE)) > 0) {
                len <- len + 1
                if(flipABiasedCoin[len] == 1) {
                        writeLines(oneLine, conWrite)
                }
        }
        close(conRead)
        close(conWrite)
}

subTweets <- as.numeric(countLines(twitterSubSampling))

Train and Test Data

twitterTrain <- file.path(dataPath, "train.twitter.txt")
twitterTest <- file.path(dataPath, "test.twitter.txt")
if(!file.exists(twitterTrain) || !file.exists(twitterTest)) {
        trainRate <- .7
        flipABiasedCoin <- rbinom(subTweets, size = 1, prob = trainRate)
        conRead <- file(twitterSubSampling, "r")
        conWriteTrain <- file(twitterTrain, "w")
        conWriteTest <- file(twitterTest, "w")
        len <- 0
        while (length(oneLine <- readLines(conRead, 1, warn = FALSE)) > 0) {
                len <- len + 1
                if(flipABiasedCoin[len] == 1) {
                        writeLines(oneLine, conWriteTrain)
                } else {
                        writeLines(oneLine, conWriteTest)
                }
        }
        close(conRead)
        close(conWriteTrain)
        close(conWriteTest)
}

trainTweets <- as.numeric(countLines(twitterTrain))
trainTweets
## [1] 16497

Tokenization

twitterCorpus <- readtext(twitterTrain) %>% 
        corpus()
summary(twitterCorpus)
## Corpus consisting of 1 document:
## 
##               Text Types Tokens Sentences
##  train.twitter.txt 27590 257031     18142
## 
## Source: /Users/warhol/Documents/!work/Data-Science-Capstone/MilestoneReport/* on x86_64 by warhol
## Created: Tue Jul 31 14:55:36 2018
## Notes:

Load Bad Words

profanity <- readLines(badwords)

Task 2: Exploratory Data Analysis

Tasks to accomplish

  1. Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
  2. Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

Questions to consider

  1. Some words are more frequent than others - what are the distributions of word frequencies?
  2. What are the frequencies of 2-grams and 3-grams in the dataset?
  3. How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
  4. How do you evaluate how many of the words come from foreign languages?
  5. Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?
Field Unit Sample sequence 1-gram sequence 2-gram sequence 3-gram sequence
Computational linguistics word … to be or not to be … …, to, be, or, not, to, be, … …, to be, be or, or not, not to, to be, … …, to be or, be or not, or not to, not to be, …

Top 20

twitterToken <- twitterCorpus %>% 
        # nomarize words
        tokens(remove_punct = TRUE,
               remove_numbers = TRUE,
               remove_url = TRUE,
               include_docvars = FALSE) %>%
        # removing profanity and other words
        tokens_remove(profanity)

twitterDfm <- twitterToken %>% 
        dfm()
twitterDfm %>%
        topfeatures(20)
##  the   to    i    a  you  and  for   in   of   is   it   my   on that   me 
## 6411 5433 5024 4342 3888 3144 2738 2589 2585 2418 2133 2062 1917 1626 1462 
##   be   at have with your 
## 1278 1268 1231 1218 1190

Plot Word Cloud

twitterDfm %>% 
        textplot_wordcloud(min_count = 6,
                           random_order = FALSE, 
                           rotation = .25,
                           color = RColorBrewer::brewer.pal(8, "Dark2"))

Nomarize Words (Remove Stop Word)

twitterDfmNoStopWord <- twitterToken %>% 
        # removing stop words
        tokens_remove(stopwords('english')) %>%
        dfm()

Top 20 Nomarized Words

twitterDfmNoStopWord %>%
        topfeatures(20)
##   just   like    get   good   love    can    day thanks     rt    now 
##   1041    847    823    758    717    655    641    630    602    594 
##    one  great    new   time      u   know  today     go    lol    see 
##    572    546    524    524    522    519    515    473    450    447

Plot Normarized Word Cloud

twitterDfmNoStopWord %>%
        textplot_wordcloud(min_count = 6,
                           random_order = FALSE,
                           max_words = 100,
                           rotation = .25,
                           color = RColorBrewer::brewer.pal(8, "Dark2"))

Normarized Word Frequency Plots

featuresTwitter <- twitterDfmNoStopWord %>%
        textstat_frequency(n = 80)

# Sort by reverse frequency order
featuresTwitter$feature <- featuresTwitter %>% 
        with(reorder(feature, -frequency))

featuresTwitter %>%
        ggplot(aes(x = feature, y = frequency)) +
        geom_point() + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1))

2-Gram Top 20.

twitterDfm2Gram <- twitterToken %>%
        tokens_ngrams(n = 2) %>%
        dfm()
twitterDfm2Gram %>%
        topfeatures(20)
##     in_the    for_the     of_the     on_the      to_be thanks_for 
##        526        489        438        349        315        309 
##     to_the  thank_you     i_love     at_the     if_you     i_have 
##        296        262        248        241        238        226 
##       i_am     have_a   going_to      for_a     to_see    will_be 
##        225        219        215        211        196        196 
##       is_a    a_great 
##        190        189

2-Gram Word Cloud

twitterDfm2Gram %>%
        textplot_wordcloud(min_count = 6,
                           random_order = FALSE,
                           max_words = 100,
                           rotation = .25,
                           color = RColorBrewer::brewer.pal(8, "Dark2"))

2-Gram Frequency Plots

featuresTwitter2Gram <- twitterDfm2Gram %>%
        textstat_frequency(n = 80)

# Sort by reverse frequency order
featuresTwitter2Gram$feature <- featuresTwitter2Gram %>%
        with(reorder(feature, -frequency))

featuresTwitter2Gram %>%
        ggplot(aes(x = feature, y = frequency)) +
        geom_point() + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1))

3-Gram Top 20

twitterDfm3Gram <- twitterToken %>%
        tokens_ngrams(n = 3) %>%
        dfm()
twitterDfm3Gram %>%
        topfeatures(20)
##     thanks_for_the      thank_you_for         i_love_you 
##                144                 62                 59 
## looking_forward_to          i_want_to      can't_wait_to 
##                 59                 56                 54 
##        going_to_be         one_of_the       have_a_great 
##                 49                 49                 46 
##     for_the_follow           a_lot_of        is_going_to 
##                 43                 42                 41 
##         to_see_you          i_need_to          i_have_to 
##                 41                 41                 34 
##        let_me_know        a_great_day         you_have_a 
##                 33                 32                 32 
##           i_have_a       i'm_going_to 
##                 31                 31

3-Gram Word Cloud

twitterDfm3Gram %>%
        textplot_wordcloud(#min_count = 4,
                           random_order = FALSE,
                           max_words = 50,
                           rotation = .25,
                           color = RColorBrewer::brewer.pal(8, "Dark2"))

3-Gram Frequency Plots

featuresTwitter3Gram <- twitterDfm3Gram %>%
        textstat_frequency(60)

# Sort by reverse frequency order
featuresTwitter3Gram$feature <- featuresTwitter3Gram %>% 
        with(reorder(feature, -frequency))

featuresTwitter3Gram %>%
        ggplot(aes(x = feature, y = frequency)) +
        geom_point() +
        theme(axis.text.x = element_text(angle = 90, hjust = 1))

1-gram 90%tile:

featuresTwitterFull <- twitterDfmNoStopWord %>% textstat_frequency()
summary(featuresTwitterFull)
##    feature            frequency             rank          docfreq 
##  Length:20798       Min.   :   1.000   Min.   :    1   Min.   :1  
##  Class :character   1st Qu.:   1.000   1st Qu.: 5200   1st Qu.:1  
##  Mode  :character   Median :   1.000   Median :10400   Median :1  
##                     Mean   :   5.548   Mean   :10400   Mean   :1  
##                     3rd Qu.:   3.000   3rd Qu.:15599   3rd Qu.:1  
##                     Max.   :1041.000   Max.   :20798   Max.   :1  
##     group          
##  Length:20798      
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
quantile(featuresTwitterFull$frequency, c(0, .1, .5, .9, 1))
##   0%  10%  50%  90% 100% 
##    1    1    1    8 1041

2-gram 90%tile:

featuresTwitter2GramFull <- twitterDfm2Gram %>% textstat_frequency()
summary(featuresTwitter2GramFull)
##    feature            frequency            rank           docfreq 
##  Length:119887      Min.   :  1.000   Min.   :     1   Min.   :1  
##  Class :character   1st Qu.:  1.000   1st Qu.: 29972   1st Qu.:1  
##  Mode  :character   Median :  1.000   Median : 59944   Median :1  
##                     Mean   :  1.703   Mean   : 59944   Mean   :1  
##                     3rd Qu.:  1.000   3rd Qu.: 89916   3rd Qu.:1  
##                     Max.   :526.000   Max.   :119887   Max.   :1  
##     group          
##  Length:119887     
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
quantile(featuresTwitter2GramFull$frequency, c(0, .1, .5, .9, 1))
##   0%  10%  50%  90% 100% 
##    1    1    1    2  526

3-gram 90%tile:

featuresTwitter3GramFull <- twitterDfm3Gram %>% textstat_frequency()
summary(featuresTwitter3GramFull)
##    feature            frequency            rank           docfreq 
##  Length:185306      Min.   :  1.000   Min.   :     1   Min.   :1  
##  Class :character   1st Qu.:  1.000   1st Qu.: 46327   1st Qu.:1  
##  Mode  :character   Median :  1.000   Median : 92654   Median :1  
##                     Mean   :  1.102   Mean   : 92654   Mean   :1  
##                     3rd Qu.:  1.000   3rd Qu.:138980   3rd Qu.:1  
##                     Max.   :144.000   Max.   :185306   Max.   :1  
##     group          
##  Length:185306     
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
quantile(featuresTwitter3GramFull$frequency, c(0, .1, .5, .9, 1))
##   0%  10%  50%  90% 100% 
##    1    1    1    1  144

Seems to be in accordance with the Zipf’s law.

Task 3: Modeling

Tasks to accomplish

  1. Build basic n-gram model - using the exploratory analysis you performed, build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words.
  2. Build a model to handle unseen n-grams - in some cases people will want to type a combination of words that does not appear in the corpora. Build a model to handle cases where a particular n-gram isn’t observed.

Questions to consider

  1. How can you efficiently store an n-gram model (think Markov Chains)?
  2. How can you use the knowledge about word frequencies to make your model smaller and more efficient?
  3. How many parameters do you need (i.e. how big is n in your n-gram model)?
  4. Can you think of simple ways to “smooth” the probabilities (think about giving all n-grams a non-zero probability even if they aren’t observed in the data) ?
  5. How do you evaluate whether your model is any good?
  6. How can you use backoff models to estimate the probability of unobserved n-grams?

Good Turing Estimation

simpleGoodTuring <- function(r, Nr, sd = 1.65) {
        # number of words
        N <- sum(r * Nr)
        d <- diff(r)
    
        ## Turing estimate
        # turing estimate index
        ti <- which(d == 1)
        # discount coefficients of Turing estimate
        dct <- numeric(length(r))
        dct[ti] <- (r[ti] + 1) / r[ti] * c(Nr[-1], 0)[ti] / Nr[ti]

        ## Linear Good-Turing estimate
        Zr <- Nr / c(1, 0.5 * (d[-1] + d[-length(d)]), d[length(d)])
        f <- lsfit(log(r), log(Zr))
        coef <- f$coef
        # corrected term frequency
        rc <- r * (1 + 1 / r)^(1 + coef[2])  
        # discount coefficients of Linear Good-Turing estimate
        dclgt <- rc / r

        ## make switch from Turing to LGT estimates
        # standard deviation of term frequencies between 'r' and 'rc' (?)
        rsd <- rep(1,length(r))        
        rsd[ti] <- (seq_len(length(r))[ti] + 1) / Nr[ti] * sqrt(Nr[ti + 1] * (1 + Nr[ti + 1] / Nr[ti]))
        
        dc <- dct
        for (i in 1:length(r)) {
            if (abs(dct[i] - dclgt[i]) * r[i] / rsd[i] <= sd) {
                dc[i:length(dc)] <- dclgt[i:length(dc)]
                break
            }
        }

        ## renormalize the probabilities for observed objects
        # summation of probabilities
        sump <- sum(dc * r * Nr) / N
        # renormalized discount coefficients
        dcr <- (1 - Nr[1] / N) * dc / sump
        
        # term frequency
        tf <- c(Nr[1] / N, r * dcr)
        p <- c(Nr[1] / N, r * dcr / N)
        names(p) <- names(tf) <- c(0, r)        
        
        list(p = p, r = tf)
}

3-gram Model using Katz’s back-off model

sgtFactory <- function() {
        NrTbl1 <- textstat_frequency(twitterDfm) %>%
                select(frequency) %>%
                mutate(freqOfFrequency = 1) %>%
                group_by(frequency) %>%
                summarise_all(sum)
        
        SGT1 <- simpleGoodTuring(NrTbl1$frequency, NrTbl1$freqOfFrequency)
        
        NrTbl2 <- textstat_frequency(twitterDfm2Gram) %>%
                select(frequency) %>%
                mutate(freqOfFrequency = 1) %>%
                group_by(frequency) %>%
                summarise_all(sum)
        
        SGT2 <- simpleGoodTuring(NrTbl2$frequency, NrTbl2$freqOfFrequency)

        NrTbl3 <- textstat_frequency(twitterDfm3Gram) %>%
                select(frequency) %>%
                mutate(freqOfFrequency = 1) %>%
                group_by(frequency) %>%
                summarise_all(sum)
        
        SGT3 <- simpleGoodTuring(NrTbl3$frequency, NrTbl3$freqOfFrequency)
        
        c(
                dUnigram = function(freq) {
                        SGT1$p[as.character(freq)]
                },
                dBigram = function(freq) {
                        SGT2$r[as.character(freq)] / freq
                },
                dTrigram = function(freq) {
                        SGT3$r[as.character(freq)] / freq
                }
        )
}

SGT <- sgtFactory()

nextWords <- function(input, outputs = 3, k = 0) {
        # k is the least important of the parameters. It is usually chosen to be 0.
        # However, empirical testing may find better values for k.
        inputs <- str_split(input, "\\s+")[[1]]
        inputsSize <- length(inputs)
        if (inputsSize > 1) {
                preTriGram <- paste(inputs[inputsSize - 1],
                                 inputs[inputsSize],
                                 sep = "_")

                nextWordDfm <- dfm_select(twitterDfm3Gram, 
                                          paste0(preTriGram, "_*"))
        } else {
                if (inputs == "") { return() }
                nextWordDfm <- NULL
        }
        preBiGram <- inputs[inputsSize]

        # extract n-gram that starts with input
        featuresNextWord <- NULL

        if (length(nextWordDfm) > k) {
                prevWordDfm <- dfm_select(twitterDfm2Gram,
                                          phrase(preTriGram))

                prevWordFreq <- textstat_frequency(prevWordDfm)$frequency

                # data frame
                featuresNextWord <-
                        textstat_frequency(nextWordDfm) %>%
                        mutate(p_bo = SGT$dTrigram(frequency) * frequency / prevWordFreq)

                # human readable outputs
                featuresNextWord$feature <-
                        sapply(as.vector(featuresNextWord$feature),
                               function(x) {
                                       str_split(x, "_")[[1]][3]
                               })
                
                # Sort by reverse frequency order
                featuresNextWord$feature <-
                        with(featuresNextWord,
                             reorder(feature,-p_bo))
                
        } else {
                nextWordDfm <- dfm_select(twitterDfm2Gram,
                                          paste0(preBiGram, "_*"))
                
                if (length(nextWordDfm) > k) {
                        prevWordDfm <- dfm_select(twitterDfm,
                                                  phrase(preBiGram))
                        prevWordFreq <- textstat_frequency(prevWordDfm)$frequency
        
                        # data frame
                        featuresNextWord <-
                                textstat_frequency(nextWordDfm) %>%
                                mutate(p_bo = SGT$dBigram(frequency) * frequency / prevWordFreq)
        
                        # human readable outputs
                        featuresNextWord$feature <-
                                sapply(as.vector(featuresNextWord$feature),
                                       function(x) {
                                               str_split(x, "_")[[1]][2]
                                       })
                        
                        # Sort by reverse frequency order
                        featuresNextWord$feature <-
                                with(featuresNextWord,
                                     reorder(feature,-p_bo))
                        
                } else {
                        nextWordDfm <- twitterDfm
                        featuresNextWord <-
                                textstat_frequency(nextWordDfm) %>%
                                mutate(p_bo = SGT$dUnigram(frequency))
        
                        # Sort by reverse frequency order
                        featuresNextWord$feature <-
                                with(featuresNextWord,
                                     reorder(feature,-p_bo))
                                
                }
        }
        
        featuresNextWord %>% slice(1:outputs)
}

I went to be

ggplot(nextWords("I went to be"), aes(x = feature, y = p_bo)) +
        geom_bar(stat = "identity") + 
        xlab("Next word") + ylab("P_bo")

ggplot(nextWords("aaaaa bbbbb zzzzz"), aes(x = feature, y = p_bo)) +
        geom_bar(stat = "identity") + 
        xlab("Next word") + ylab("P_bo")

References